18/02/2019

Big Data?

Class Survey

Posted on flickr by BBVAtech in 2012, by Asigra [CC BY 2.0](https://creativecommons.org/licenses/by/2.0/)

Posted on flickr by BBVAtech in 2012, by Asigra CC BY 2.0

Expert Survey (UC Berkeley, 2014)

  • Ask 40 experts to define "big data"

Expert Survey (UC Berkeley, 2014)

  • Ask 40 experts to define "big data"
  • … get 40 different definitions :)

Expert Survey (UC Berkeley, 2014)

Expert Survey: Example 1

"Big Data is the result of collecting information at its most granular level — it’s what you get when you instrument a system and keep all of the data that your instrumentation is able to gather."

Jon Bruner (Editor-at-Large, O’Reilly Media)

Expert Survey: Example 2

"Big data is data that contains enough observations to demand unusual handling because of its sheer size, though what is unusual changes over time and varies from one discipline to another."

Annette Greiner
(Lecturer, UC Berkeley School of Information)

Expert Survey: Example 3

"[…] 'big data' will ultimately describe any dataset large enough to necessitate high-level programming skill and statistically defensible methodologies in order to transform the data asset into something of value."

Reid Bryant
(Data Scientist, Brooks Bell)

Conclusion

Conclusion

  • Large amounts of data
  • Various types/formats of data
  • Unusual sources
  • Speed of data flow/stream
  • Use programming and statistics (in a broad sense) to extract value

'Learn Big Data'?

Domains Affected

  • How to design/set-up the machinery to handle large amounts of data?
  • How to use the existing machinery most efficiently for large amounts of data?
  • How to approach the analysis of large amounts of data with statistics?

Focus in This Course

  • How to design/set-up the machinery to handle large amounts of data?
  • How to use the existing machinery most efficiently for large amounts of data?
  • How to approach the analysis of large amounts of data with statistics?

Focus in This Course

  • How to design/set-up the machinery to handle large amounts of data?
  • How to use the existing machinery most efficiently for large amounts of data?
  • How to approach the analysis of large amounts of data with statistics?
    1. Compute 'usual' statistics based on large dataset (many observations).

Focus in This Course

  • How to design/set-up the machinery to handle large amounts of data?
  • How to use the existing machinery most efficiently for large amounts of data?
  • How to approach the analysis of large amounts of data with statistics?
    1. Compute 'usual' statistics based on large dataset (many observations).
    2. Multivariate Statistics to gain insights from Big Data (many variables).

Big Data in Scientific Research

Big Data in the Sciences

  • Mother nature always has provided the data, but…
    • … instruments have gotten more precise
    • … new measurement methods have been developed
  • Prominent examples: Astronomy, Genomics/Bioinformatics
Photo by Joe Parks, [(CC BY-NC 2.0)](https://creativecommons.org/licenses/by-nc/2.0/) source: https://flic.kr/p/e2umhv

Photo by Joe Parks, (CC BY-NC 2.0) source: https://flic.kr/p/e2umhv

Big Data in the Sciences

Big Data in the Sciences

Big Data in the Social Sciences

Big Data in the Social Sciences

  • Hardware: Diffusion of the Internet and mobile-phone networks.
  • Software: Web 2.0 Technologies (APIs, JSON, Programmable Web, etc.).

Big Data in the Social Sciences

  • Hardware: Diffusion of the Internet and mobile-phone networks.
  • Software: Web 2.0 Technologies (APIs, JSON, Programmable Web, etc.).
    • Backbone of social media and many prominent web services (e.g., Google Maps).
    • Data integration across platforms and services.
    • Exchange of data between/across applications.

Big Data in the Social Sciences/Economics

Source: @bollen_etal2011

Source: Bollen, Mao, and Zeng (2011)

Big Data in the Social Sciences/Economics

Source: @bollen_etal2011

Source: Bollen, Mao, and Zeng (2011)

Big Data in the Social Sciences/Economics

Source: @ranco_etal2015

Source: Ranco (2015)

Big Data in the Social Sciences/Economics

  • Often tied to web applications and digitization of economic and political processes.

Big Data in the Social Sciences/Economics

  • Often tied to web applications and digitization of economic and political processes.
  • Volume of data is substantial (but usually smaller than in the natural sciences).

Big Data in the Social Sciences/Economics

  • Often tied to web applications and digitization of economic and political processes.
  • Volume of data is substantial (but usually smaller than in natural sciences)
  • Variety and variability often more challenging than in natural sciences.
    • Various sources
    • Data generation/sensors are independent from research endeavor.

Big Data in the Social Sciences/Economics

  • Often tied to web applications and digitization of economic and political processes.
  • Volume of data is substantial (but usually smaller than in natural sciences)
  • Variety and variability often more challenging than in natural sciences.
    • Various sources
    • Data generation/sensors are independent from research endeavor.
  • Questions/Problems often similar to applied research in the industry.
    • Key difference: usually no streaming applications (velocity not that much of an an issue).

This Course

Two Parts

  1. Big Data: Basic Concepts and Applications in R (Ulrich Matter)
  2. Multivariate Statistics in Python (Matthias Fengler)

Objectives

‐ Understand the concept of Big Data in the context of economic research.

‐ Understand the technical challenges of Big Data Analytics and how to practically deal with them.

‐ Students will know the basic statistical techniques of clustering, dimensionality reduction, and factor models.

Schedule: Part I

  1. Introduction: Big Data, Data Economy (Concepts). M: Walkowiak (2016): Chapter 1
  2. Programming with Data, R Refresher Course (Concepts/Applied). M: Walkowiak (2016): Chapter 2
  3. Computation and Memory (Concepts)
  4. Cleaning and Transformation of Big Data (Applied). M: Walkowiak (2016): Chapter 3: p. 74‐118.
  5. Aggregation and Visualization (Applied: data tables, ggplot). M: Walkowiak (2016): Chapter 3: p. 118‐127. C: Wickham et al. (2015), Schwabish (2014).
  6. Distributed Systems, MapReduce/Hadoop with R (Concepts/Applied). M: Walkowiak (2016): Chapter 4
  7. Data Storage, Databases Interaction with R. M: Walkowiak (2016): Chapter 5

Schedule: Part I

  1. Introduction: Big Data, Data Economy (Concepts). M: Walkowiak (2016): Chapter 1
  2. Programming with Data, R Refresher Course (Concepts/Applied). M: Walkowiak (2016): Chapter 2
  3. Computation and Memory (Concepts)
  4. Cleaning and Transformation of Big Data (Applied). M: Walkowiak (2016): Chapter 3: p. 74‐118.
  5. Aggregation and Visualization (Applied: data tables, ggplot). M: Walkowiak (2016): Chapter 3: p. 118‐127. C: Wickham et al. (2015), Schwabish (2014).
  6. Distributed Systems, MapReduce/Hadoop with R (Concepts/Applied). M: Walkowiak (2016): Chapter 4
  7. Data Storage, Databases Interaction with R. M: Walkowiak (2016): Chapter 5

Schedule: Part I

  1. Introduction: Big Data, Data Economy (Concepts). M: Walkowiak (2016): Chapter 1
  2. Programming with Data, R Refresher Course (Concepts/Applied). M: Walkowiak (2016): Chapter 2
  3. Computation and Memory (Concepts)
  4. Cleaning and Transformation of Big Data (Applied). M: Walkowiak (2016): Chapter 3: p. 74‐118.
  5. Aggregation and Visualization (Applied: data tables, ggplot). M: Walkowiak (2016): Chapter 3: p. 118‐127. C: Wickham et al. (2015), Schwabish (2014).
  6. Distributed Systems, MapReduce/Hadoop with R (Concepts/Applied). M: Walkowiak (2016): Chapter 4
  7. Data Storage, Databases Interaction with R. M: Walkowiak (2016): Chapter 5

Schedule: Part I

  1. Introduction: Big Data, Data Economy (Concepts). M: Walkowiak (2016): Chapter 1
  2. Programming with Data, R Refresher Course (Concepts/Applied). M: Walkowiak (2016): Chapter 2
  3. Computation and Memory (Concepts)
  4. Cleaning and Transformation of Big Data (Applied). M: Walkowiak (2016): Chapter 3: p. 74‐118.
  5. Aggregation and Visualization (Applied: data tables, ggplot). M: Walkowiak (2016): Chapter 3: p. 118‐127. C: Wickham et al. (2015), Schwabish (2014).
  6. Distributed Systems, MapReduce/Hadoop with R (Concepts/Applied). M: Walkowiak (2016): Chapter 4
  7. Data Storage, Databases Interaction with R. M: Walkowiak (2016): Chapter 5

Schedule: Part I

  1. Introduction: Big Data, Data Economy (Concepts). M: Walkowiak (2016): Chapter 1
  2. Programming with Data, R Refresher Course (Concepts/Applied). M: Walkowiak (2016): Chapter 2
  3. Computation and Memory (Concepts)
  4. Cleaning and Transformation of Big Data (Applied). M: Walkowiak (2016): Chapter 3: p. 74‐118.
  5. Aggregation and Visualization (Applied: data tables, ggplot). M: Walkowiak (2016): Chapter 3: p. 118‐127. C: Wickham et al. (2015), Schwabish (2014).
  6. Distributed Systems, MapReduce/Hadoop with R (Concepts/Applied). M: Walkowiak (2016): Chapter 4
  7. Data Storage, Databases Interaction with R. M: Walkowiak (2016): Chapter 5

Schedule: Part I

  1. Introduction: Big Data, Data Economy (Concepts). M: Walkowiak (2016): Chapter 1
  2. Programming with Data, R Refresher Course (Concepts/Applied). M: Walkowiak (2016): Chapter 2
  3. Computation and Memory (Concepts)
  4. Cleaning and Transformation of Big Data (Applied). M: Walkowiak (2016): Chapter 3: p. 74‐118.
  5. Aggregation and Visualization (Applied: data tables, ggplot). M: Walkowiak (2016): Chapter 3: p. 118‐127. C: Wickham et al. (2015), Schwabish (2014).
  6. Distributed Systems, MapReduce/Hadoop with R (Concepts/Applied). M: Walkowiak (2016): Chapter 4
  7. Data Storage, Databases Interaction with R. M: Walkowiak (2016): Chapter 5

Schedule: Part I

  1. Introduction: Big Data, Data Economy (Concepts). M: Walkowiak (2016): Chapter 1
  2. Programming with Data, R Refresher Course (Concepts/Applied). M: Walkowiak (2016): Chapter 2
  3. Computation and Memory (Concepts)
  4. Cleaning and Transformation of Big Data (Applied). M: Walkowiak (2016): Chapter 3: p. 74‐118.
  5. Aggregation and Visualization (Applied: data tables, ggplot). M: Walkowiak (2016): Chapter 3: p. 118‐127. C: Wickham et al. (2015), Schwabish (2014).
  6. Distributed Systems, MapReduce/Hadoop with R (Concepts/Applied). M: Walkowiak (2016): Chapter 4
  7. Data Storage, Databases Interaction with R. M: Walkowiak (2016): Chapter 5

Schedule: Part II

  1. Multivariate random variables and distributions M: Härdle, Simar (2015): Chapter 4‐5
  2. Clustering M: Härdle, Simar (2015): Chapter 13
  3. Principal Component Analysis M: Härdle, Simar (2015): Chapter 11
  4. Factor Models M: Härdle, Simar (2015): Chapter 12
  5. Summary/Q&A

Examination: Part I

  • Decentral ‐ Group examination 'paper' (all given the same grades) (50%).
  • Group size: 2 to max 3. people.
  • Take‐home exercises (group task): Application of basic concepts in R when working with big data. Conceptual questions related to the application.
Hand in: 2 weeks after the semester break!
(More details next week)

Examination: Part II

  • Decentral ‐ Oral examination (individual) (50%, 15 mins.)
  • Oral exam: Multivariate statistics (methods/concepts).

This Part of the Course

Approach to Big Data Statistics

  1. Analyze the underlying problem.
  2. Understand the underlying concept.
  3. Apply the understanding in a hands-on exercise.

R used in two ways

  • A tool to analize problems posed by large datasets.
    • For example, memory usage (in R).
  • A practical tool for Big Data Analytics.

Example

Preparations

# read dataset into R
economics <- read.csv("../data/economics.csv")
# have a look at the data
head(economics, 2)
##         date   pce    pop psavert uempmed unemploy
## 1 1967-07-01 507.4 198712    12.5     4.5     2944
## 2 1967-08-01 510.5 198911    12.5     4.7     2945
# create a 'large' dataset out of this
for (i in 1:3) {
     economics <- rbind(economics, economics)
}
dim(economics)
## [1] 4592    6

Example

Compute the real personal consumption expenditures (pce): Divide each value of pce by the deflator 1.05.

# Naïve approach (ignorant of R)
deflator <- 1.05 # define deflator
# iterate through each observation
pce_real <- c()
n_obs <- length(economics$pce)
for (i in 1:n_obs) {
  pce_real <- c(pce_real, economics$pce[i]/deflator)
}

# look at the result
head(pce_real, 2)
## [1] 483.2381 486.1905

Example

How long does it take?

# Naïve approach (ignorant of R)
deflator <- 1.05 # define deflator
# iterate through each observation
pce_real <- list()
n_obs <- length(economics$pce)
time_elapsed <-
     system.time(
         for (i in 1:n_obs) {
              pce_real <- c(pce_real, economics$pce[i]/deflator)
})

time_elapsed
##    user  system elapsed 
##   0.082   0.004   0.096

Example

Assuming a linear time algorithm (\(O(n)\)), we need that much time for one additional row of data:

time_per_row <- time_elapsed[3]/n_obs
time_per_row
##      elapsed 
## 2.090592e-05

Example

If we deal with big data, say 100 million rows, that is

# in seconds
(time_per_row*100^4) 
##  elapsed 
## 2090.592
# in minutes
(time_per_row*100^4)/60 
##  elapsed 
## 34.84321
# in hours
(time_per_row*100^4)/60^2 
##   elapsed 
## 0.5807201

Example

What happens in the background?

  • Evaluation/computation
  • Memory allocation/deallocation

Example

Can we improve this?

# Improve memory allocation (still somewhat ignorant of R)
deflator <- 1.05 # define deflator
n_obs <- length(economics$pce)
pce_real <- list()
# allocate memory beforehand
# tell R how long the list will be
length(pce_real) <- n_obs

Example

Can we improve this?

# Improve memory allocation (still somewhat ignorant of R)
deflator <- 1.05 # define deflator
n_obs <- length(economics$pce)
pce_real <- list()
# allocate memory beforehand
# tell R how long the list will be
length(pce_real) <- n_obs
# iterate through each observation
time_elapsed <-
     system.time(
         for (i in 1:n_obs) {
              pce_real[[i]] <- economics$pce[i]/deflator
})

time_elapsed
##    user  system elapsed 
##   0.019   0.000   0.019

Example

Any improvements?

time_per_row <- time_elapsed[3]/n_obs
time_per_row
##      elapsed 
## 4.137631e-06

Example

# in seconds
(time_per_row*100^4) 
##  elapsed 
## 413.7631
# in minutes
(time_per_row*100^4)/60 
##  elapsed 
## 6.896051
# in hours
(time_per_row*100^4)/60^2 
##   elapsed 
## 0.1149342

This looks much better, but we can do even better…

Example

Can we improve this?

# Do it 'the R wqy'
deflator <- 1.05 # define deflator
# Exploit R's vectorization!
time_elapsed <- 
     system.time(
     pce_real <- economics$pce/deflator
          )
# same result
head(pce_real, 2)
## [1] 483.2381 486.1905
# but much faster!
time_elapsed
##    user  system elapsed 
##       0       0       0
time_per_row <- time_elapsed[3]/n_obs

Example

In fact, system.time() is not precise enough to capture the time elapsed…

# in seconds
(time_per_row*100^4) 
## elapsed 
##       0
# in minutes
(time_per_row*100^4)/60 
## elapsed 
##       0
# in hours
(time_per_row*100^4)/60^2 
## elapsed 
##       0

What do we learn from this?

  1. How R allocates and deallocates memory can have a substantial effect on computation time.
    • (Particularly, if we deal with a large dataset!)
  2. In what way the computation is implemented can matter a lot for the time elapsed.
    • (For example, loops vs. vectorization/apply)
More on this in Lecture 3

Resources for Part I

Literature

Notes, Slides, Code, et al.

Suggested Learning Procedure

  • Clone/fork the course's GitHub-repository
  • During class, use the Rmd-file of the slide-set as basis for your notes
  • After class, enrich/merge/extend your notes with the lecture notes.

TODO (for next week!)

References

Bollen, Johan, Huina Mao, and Xiaojun Zeng. 2011. “Twitter Mood Predicts the Stock Market.” Journal of Computational Science 2 (1): 1–8. doi:https://doi.org/10.1016/j.jocs.2010.12.007.

Ranco, Darko AND Caldarelli, Gabriele AND Aleksovski. 2015. “The Effects of Twitter Sentiment on Stock Price Returns.” PLOS ONE 10 (9). Public Library of Science: 1–21. doi:10.1371/journal.pone.0138441.